Problem Statement
Business Context
Buying and selling used phones and tablets used to be something that happened on a handful of online marketplace sites. But the used and refurbished device market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth $ 52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used phones and tablets that offer considerable savings compared with new models.
Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing one. There are plenty of other benefits associated with the used device market. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished devices. Maximizing the longevity of devices through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost this segment as consumers cut back on discretionary spending and buy phones and tablets only for immediate needs.
Objective The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished devices. ReCell, a startup aiming to tap the potential in this market, has hired you as a data scientist. They want you to analyze the data provided and build a linear regression model to predict the price of a used phone/tablet and identify factors that significantly influence it.
Data Description The data contains the different attributes of used/refurbished phones and tablets. The data was collected in the year 2021. The detailed data dictionary is given below.
brand_name: Name of manufacturing brand os: OS on which the device runs screen_size: Size of the screen in cm 4g: Whether 4G is available or not 5g: Whether 5G is available or not main_camera_mp: Resolution of the rear camera in megapixels selfie_camera_mp: Resolution of the front camera in megapixels int_memory: Amount of internal memory (ROM) in GB ram: Amount of RAM in GB battery: Energy capacity of the device battery in mAh weight: Weight of the device in grams release_year: Year when the device model was released days_used: Number of days the used/refurbished device has been used normalized_new_price: Normalized price of a new device of the same model in euros normalized_used_price: Normalized price of the used/refurbished device in euros
Import Necessary Libraries
# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# split the data into train and test
from sklearn.model_selection import train_test_split
# to build linear regression_model
from sklearn.linear_model import LinearRegression
# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# to build linear regression_model using statsmodels
import statsmodels.api as sm
# to compute VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
Load the Dataset
Rcel = pd.read_csv("/Users/Moses Okesola/Downloads/used_device_data.csv")
Rcel.head()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 14.50 | yes | no | 13.0 | 5.0 | 64.0 | 3.0 | 3020.0 | 146.0 | 2020 | 127 | 4.307572 | 4.715100 |
| 1 | Honor | Android | 17.30 | yes | yes | 13.0 | 16.0 | 128.0 | 8.0 | 4300.0 | 213.0 | 2020 | 325 | 5.162097 | 5.519018 |
| 2 | Honor | Android | 16.69 | yes | yes | 13.0 | 8.0 | 128.0 | 8.0 | 4200.0 | 213.0 | 2020 | 162 | 5.111084 | 5.884631 |
| 3 | Honor | Android | 25.50 | yes | yes | 13.0 | 8.0 | 64.0 | 6.0 | 7250.0 | 480.0 | 2020 | 345 | 5.135387 | 5.630961 |
| 4 | Honor | Android | 15.32 | yes | no | 13.0 | 8.0 | 64.0 | 3.0 | 5000.0 | 185.0 | 2020 | 293 | 4.389995 | 4.947837 |
Rcel.T
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 3444 | 3445 | 3446 | 3447 | 3448 | 3449 | 3450 | 3451 | 3452 | 3453 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| brand_name | Honor | Honor | Honor | Honor | Honor | Honor | Honor | Honor | Honor | Honor | ... | Apple | Apple | Apple | Apple | Asus | Asus | Asus | Alcatel | Alcatel | Alcatel |
| os | Android | Android | Android | Android | Android | Android | Android | Android | Android | Android | ... | iOS | iOS | iOS | iOS | Android | Android | Android | Android | Android | Android |
| screen_size | 14.5 | 17.3 | 16.69 | 25.5 | 15.32 | 16.23 | 13.84 | 15.77 | 15.32 | 16.23 | ... | 10.34 | 15.37 | 12.9 | 15.27 | 16.74 | 15.34 | 15.24 | 15.8 | 15.8 | 12.83 |
| 4g | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes | ... | yes | yes | yes | yes | yes | yes | yes | yes | yes | yes |
| 5g | no | yes | yes | yes | no | no | no | no | no | no | ... | no | no | no | no | no | no | no | no | no | no |
| main_camera_mp | 13.0 | 13.0 | 13.0 | 13.0 | 13.0 | 13.0 | 8.0 | 13.0 | 13.0 | 13.0 | ... | 12.0 | 8.0 | 8.0 | 8.0 | NaN | NaN | 13.0 | 13.0 | 13.0 | 13.0 |
| selfie_camera_mp | 5.0 | 16.0 | 8.0 | 8.0 | 8.0 | 8.0 | 5.0 | 8.0 | 16.0 | 8.0 | ... | 7.0 | 7.0 | 7.0 | 7.0 | 24.0 | 8.0 | 8.0 | 5.0 | 5.0 | 5.0 |
| int_memory | 64.0 | 128.0 | 128.0 | 64.0 | 64.0 | 64.0 | 32.0 | 64.0 | 128.0 | 128.0 | ... | 64.0 | 64.0 | 64.0 | 64.0 | 128.0 | 64.0 | 128.0 | 32.0 | 32.0 | 16.0 |
| ram | 3.0 | 8.0 | 8.0 | 6.0 | 3.0 | 4.0 | 2.0 | 4.0 | 6.0 | 6.0 | ... | 3.0 | 4.0 | 4.0 | 4.0 | 8.0 | 6.0 | 8.0 | 3.0 | 2.0 | 2.0 |
| battery | 3020.0 | 4300.0 | 4200.0 | 7250.0 | 5000.0 | 4000.0 | 3020.0 | 3400.0 | 4000.0 | 4000.0 | ... | 1821.0 | 3969.0 | 3046.0 | 3110.0 | 6000.0 | 5000.0 | 4000.0 | 4000.0 | 4000.0 | 4000.0 |
| weight | 146.0 | 213.0 | 213.0 | 480.0 | 185.0 | 176.0 | 144.0 | 164.0 | 165.0 | 176.0 | ... | 148.0 | 226.0 | 188.0 | 194.0 | 240.0 | 190.0 | 200.0 | 165.0 | 160.0 | 168.0 |
| release_year | 2020 | 2020 | 2020 | 2020 | 2020 | 2020 | 2020 | 2020 | 2020 | 2020 | ... | 2020 | 2019 | 2019 | 2019 | 2019 | 2019 | 2018 | 2020 | 2020 | 2020 |
| days_used | 127 | 325 | 162 | 345 | 293 | 223 | 234 | 219 | 161 | 327 | ... | 235 | 298 | 351 | 208 | 325 | 232 | 541 | 201 | 149 | 176 |
| normalized_used_price | 4.307572 | 5.162097 | 5.111084 | 5.135387 | 4.389995 | 4.413889 | 3.878259 | 4.729421 | 4.886054 | 4.773224 | ... | 5.011901 | 5.340466 | 5.440164 | 5.100902 | 5.715316 | 4.492337 | 5.037732 | 4.35735 | 4.349762 | 4.132122 |
| normalized_new_price | 4.7151 | 5.519018 | 5.884631 | 5.630961 | 4.947837 | 5.060694 | 4.518958 | 5.188726 | 5.299916 | 5.07361 | ... | 5.875352 | 6.734175 | 6.713746 | 6.287933 | 7.059317 | 6.483872 | 6.251538 | 4.528829 | 4.624188 | 4.279994 |
15 rows × 3454 columns
Check the shape of the dataset
Rcel.shape
(3454, 15)
There are 3454 rows and 15 columns.
Get the information regarding column datatypes
Rcel.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3454 entries, 0 to 3453 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3454 non-null object 1 os 3454 non-null object 2 screen_size 3454 non-null float64 3 4g 3454 non-null object 4 5g 3454 non-null object 5 main_camera_mp 3275 non-null float64 6 selfie_camera_mp 3452 non-null float64 7 int_memory 3450 non-null float64 8 ram 3450 non-null float64 9 battery 3448 non-null float64 10 weight 3447 non-null float64 11 release_year 3454 non-null int64 12 days_used 3454 non-null int64 13 normalized_used_price 3454 non-null float64 14 normalized_new_price 3454 non-null float64 dtypes: float64(9), int64(2), object(4) memory usage: 404.9+ KB
There are eleven numeric( nine float and two integer type) and four string( object type) columns in the dataset.
List all the columns in the dataset
Rcel.columns
Index(['brand_name', 'os', 'screen_size', '4g', '5g', 'main_camera_mp',
'selfie_camera_mp', 'int_memory', 'ram', 'battery', 'weight',
'release_year', 'days_used', 'normalized_used_price',
'normalized_new_price'],
dtype='object')
Get summary statistics for the dataset
Rcel.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| brand_name | 3454 | 34 | Others | 502 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| os | 3454 | 4 | Android | 3214 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| screen_size | 3454.0 | NaN | NaN | NaN | 13.713115 | 3.80528 | 5.08 | 12.7 | 12.83 | 15.34 | 30.71 |
| 4g | 3454 | 2 | yes | 2335 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5g | 3454 | 2 | no | 3302 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| main_camera_mp | 3275.0 | NaN | NaN | NaN | 9.460208 | 4.815461 | 0.08 | 5.0 | 8.0 | 13.0 | 48.0 |
| selfie_camera_mp | 3452.0 | NaN | NaN | NaN | 6.554229 | 6.970372 | 0.0 | 2.0 | 5.0 | 8.0 | 32.0 |
| int_memory | 3450.0 | NaN | NaN | NaN | 54.573099 | 84.972371 | 0.01 | 16.0 | 32.0 | 64.0 | 1024.0 |
| ram | 3450.0 | NaN | NaN | NaN | 4.036122 | 1.365105 | 0.02 | 4.0 | 4.0 | 4.0 | 12.0 |
| battery | 3448.0 | NaN | NaN | NaN | 3133.402697 | 1299.682844 | 500.0 | 2100.0 | 3000.0 | 4000.0 | 9720.0 |
| weight | 3447.0 | NaN | NaN | NaN | 182.751871 | 88.413228 | 69.0 | 142.0 | 160.0 | 185.0 | 855.0 |
| release_year | 3454.0 | NaN | NaN | NaN | 2015.965258 | 2.298455 | 2013.0 | 2014.0 | 2015.5 | 2018.0 | 2020.0 |
| days_used | 3454.0 | NaN | NaN | NaN | 674.869716 | 248.580166 | 91.0 | 533.5 | 690.5 | 868.75 | 1094.0 |
| normalized_used_price | 3454.0 | NaN | NaN | NaN | 4.364712 | 0.588914 | 1.536867 | 4.033931 | 4.405133 | 4.7557 | 6.619433 |
| normalized_new_price | 3454.0 | NaN | NaN | NaN | 5.233107 | 0.683637 | 2.901422 | 4.790342 | 5.245892 | 5.673718 | 7.847841 |
-The average normalized used devices price is approximately 4.4 Euros.
-The normalized used prices range from 1.54 to 6.6 Euros approximately.
-The normalized used median price is 4.4 Euros approximately which is comparable to its average value.
-The average normalized used price is 16.6% lesser than the normalized new price.
-The range of both the normalized used price and the normalized new price is the same, It is approximately 5 Euros.
Check for duplicate values in the dataset
Rcel.duplicated().sum()
0
There are no duplicate values in the data
Check for missing values in the dataset
Rcel.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 179 selfie_camera_mp 2 int_memory 4 ram 4 battery 6 weight 7 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
There are missing values in the following columns: main_camera_mp, selfie_camera_mp, int_memory, ram, battery,and weight.
Create a copy of the data so that the original data remain unchanged
df = Rcel.copy()
Exploratory Data Analysis (EDA) Summary
# Get the list of all numerical variables, including 'normalized_used_price'
numerical_vars_all = Rcel.select_dtypes(include=['float64', 'int64']).columns.tolist()
# Print the list of all numerical variables
print("All Numerical Variables:")
for var in numerical_vars_all:
print(var)
All Numerical Variables: screen_size main_camera_mp selfie_camera_mp int_memory ram battery weight release_year days_used normalized_used_price normalized_new_price
# Extract features excluding the dependent variable 'normalized_used_price'
feature = Rcel.drop(columns=["normalized_used_price"])
# Check the extracted features
(feature.head()).T
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| brand_name | Honor | Honor | Honor | Honor | Honor |
| os | Android | Android | Android | Android | Android |
| screen_size | 14.5 | 17.3 | 16.69 | 25.5 | 15.32 |
| 4g | yes | yes | yes | yes | yes |
| 5g | no | yes | yes | yes | no |
| main_camera_mp | 13.0 | 13.0 | 13.0 | 13.0 | 13.0 |
| selfie_camera_mp | 5.0 | 16.0 | 8.0 | 8.0 | 8.0 |
| int_memory | 64.0 | 128.0 | 128.0 | 64.0 | 64.0 |
| ram | 3.0 | 8.0 | 8.0 | 6.0 | 3.0 |
| battery | 3020.0 | 4300.0 | 4200.0 | 7250.0 | 5000.0 |
| weight | 146.0 | 213.0 | 213.0 | 480.0 | 185.0 |
| release_year | 2020 | 2020 | 2020 | 2020 | 2020 |
| days_used | 127 | 325 | 162 | 345 | 293 |
| normalized_new_price | 4.7151 | 5.519018 | 5.884631 | 5.630961 | 4.947837 |
The features are the independent variables of the dataset.
Data visualization of numerical variables by histogram_boxplot
# Histogram-boxplot combined plots for numerical variables
def histogram_boxplot(Rcel, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
Rcel: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2,
sharex=True,
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
)
sns.boxplot(data=Rcel, x=feature, ax=ax_box2, showmeans=True, color="violet")
if bins:
sns.histplot(data=Rcel, x=feature, kde=kde, ax=ax_hist2, bins=bins)
else:
sns.histplot(data=Rcel, x=feature, kde=kde, ax=ax_hist2)
ax_hist2.axvline(Rcel[feature].mean(), color="green", linestyle="--") # Add mean to the histogram
ax_hist2.axvline(Rcel[feature].median(), color="black", linestyle="-") # Add median to the histogram
Data visualization of categorical variables by labelled barplot
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
normalized used price
Question:What does the distribution of normalized used device prices look like?
# the column that contains information about the normalized used prices
normalized_used_price_class = df["normalized_used_price"].unique()
normalized_used_price_class
array([4.30757245, 5.16209665, 5.11108377, ..., 5.03773152, 4.35734965,
4.34976167])
# plotting the distribution of normalized used device prices
plt.figure(figsize=(14, 7))
# Subplot 1: Histogram of normalized used device prices with count labels
plt.subplot(1, 2, 1)
ax1 = sns.histplot(df["normalized_used_price"], bins=20, color="skyblue", edgecolor="black")
plt.title("Distribution of Normalized Used Device Prices")
plt.xlabel("Normalized Used Price")
plt.ylabel("Frequency")
# Add count labels to the histogram
for rect in ax1.patches:
height = rect.get_height()
ax1.text(rect.get_x() + rect.get_width() / 2, height, height, ha='center', va='bottom')
# Subplot 2: Boxplot of normalized used device prices
plt.subplot(1, 2, 2)
ax2 = sns.boxplot(y="normalized_used_price", data=df, color="lightblue")
plt.title("Boxplot of Normalized Used Device Prices")
plt.tight_layout()
plt.show()
-The normalized used price distribution is normal. -The least The normalized used devices price is slightly above 1.5 Euros.
-The lowest frequency of a normalized used devices price is 1 and at close to 2 Euros.
-The highest normalized used devices price is close to 7 Euros.
-The highest frequency of a normalized used devices price is 686 and at between 4.3 to 4.6 Euros.
-Extreme normalized used prices occured at slightly below 3 Euros and above 5.8 Euros.
OS-Operating Systems
# the column that contains information about the operating systems
os_types = df["os"].unique()
os_types
array(['Android', 'Others', 'iOS', 'Windows'], dtype=object)
# Plot the labeled barplot for the OS on which the device runs
labeled_barplot(df, "os", perc=True, n=10)
-OS Android-type is 93.1% of the market of used devices.
-OS Others-type is 4% of the market of used devices.
-OS Window-type is 1.9% of the market of used devices.
-OS IOS-type is 1% of the market of used devices.
Question: What percentage of the used device market is dominated by Android devices?
# Calculate the percentage of the used device market dominated by Android devices
total_devices = len(df)
android_devices = len(df[df["os"] == "Android"])
android_market_share = (android_devices / total_devices) * 100
# Create a vertical bar plot to visualize the market share
plt.figure(figsize=(8, 6))
plt.bar(
["Android Devices", "Other Devices"],
[android_market_share, 100 - android_market_share],
color=["skyblue", "lightcoral"],
)
plt.ylabel("Percentage (%)")
plt.title("Percentage of Used Device Market Dominated by Android Devices")
plt.ylim(0, 100) # Set y-axis limits from 0 to 100 for percentage scale
# Display the percentage values on the bars
for index, value in enumerate([android_market_share, 100 - android_market_share]):
plt.text(index, value, f"{value:.2f}%", ha="center", va="bottom")
plt.show()
The percentage of the used devices market dominated by Android devices is 93.05%, and that of other operating system categories is 6.95%.
RAM
# the column that contains information about the RAM
ram_sizes = df["ram"].unique()
ram_sizes
array([ 3. , 8. , 6. , 4. , 2. , 1. , 12. , 0.5 , 0.25,
0.03, nan, 0.02, 1.5 ])
# Plot the histogram and boxplot for amount of RAM
histogram_boxplot(df, "ram")
-The distribution of RAM is right-skewed with a few scattered extreme values on both sides of the plot.
Sreen size
# Plot the histogram and boxplot for the screen size
histogram_boxplot(df, "screen_size")
The distribution of screen_size is right-skewed with some outliers on both sides of the plot.
Normalized new price
# the column that contains information about the normalized new prices
normalized_new_price_class = df["normalized_new_price"].unique()
normalized_new_price_class
array([4.71510025, 5.51901794, 5.88463072, ..., 6.25153773, 4.62418819,
4.27999391])
# plot the histogram-boxplot for Normalized price of a new device of the same model
histogram_boxplot(df, "normalized_new_price")
normalized_new_price distribution is normal with outliers on both sides of the plot.
Main camera mp: the resolution of the rear camera
# the column that contains information about the resolution of the rear camera
main_camera_mp_types = df["main_camera_mp"].unique()
main_camera_mp_types
array([13. , 8. , 5. , 10.5 , 3.15, nan, 2. , 16. , 0.3 ,
12. , 14.5 , 48. , 3. , 21. , 1.3 , 13.1 , 24. , 0.08,
20.7 , 23. , 1. , 18. , 12.2 , 12.3 , 20. , 20.2 , 4. ,
12.5 , 10. , 6.5 , 6.7 , 41. , 20.1 , 12.6 , 16.3 , 22.6 ,
19. , 21.5 , 21.2 , 8.1 , 1.2 , 22.5 ])
# Plot the histogram and boxplot for the resolution of the rear camera
histogram_boxplot(df, "main_camera_mp")
main_camera_mp distribution is right-skewed with less than 0 minimum value and a maximum value of slightly above 22 megapixels.
Selfie camera mp: the resolution of the front camera
# the column that contains information about the resolution of the frontcamera
selfie_camera_mp_types = df["selfie_camera_mp"].unique()
selfie_camera_mp_types
array([ 5. , 16. , 8. , 32. , 2. , 0.3 , 13. , 14.5 , 24. ,
10. , 1.3 , 20. , 25. , 0. , 12. , 7. , 1.2 , 3. ,
2.1 , 1. , 4. , 1.1 , nan, 10.5 , 1.6 , 1.8 , 3.5 ,
0.9 , 1.25, 18. , 1.9 , 9. , 3.7 , 16.3 , 5.1 , 2.2 ,
14. , 0.65])
# Plot the histogram and boxplot for the resolution of the front camera
histogram_boxplot(df, "selfie_camera_mp")
-selfie_camera_mp distribution is right-skewed with 0 minimum value and a maximum value above 31 megapixels. -There are a few extreme values to right of the distribution.
Internal Memory
# the column that contains information about the internal memory
int_memory_sizes = df["int_memory"].unique()
int_memory_sizes
array([6.400e+01, 1.280e+02, 3.200e+01, 1.600e+01, 2.560e+02, 5.120e+02,
8.000e+00, 4.000e+00, 5.000e-01, 2.000e-01, 1.000e-01, 6.000e-02,
nan, 1.024e+03, 2.400e+01, 1.000e-02])
# Plot the histogram and boxplot for the internal memory of the device
histogram_boxplot(df, "int_memory")
Internal memory distribution is right skewed with a few staggered extreme values. The least int_memory value is approximately zero.
Weight of the device in grams
# Plot the histogram and boxplot for the weight of the device
histogram_boxplot(df, "weight")
The weight of the device is right-skewed with a very high extreme values at the right side of the plot, though there a few extreme values at the left.
battery: Energy capacity of the device battery in mAh
# Plot the histogram and boxplot for the Energy capacity of the device battery
histogram_boxplot(df, "battery")
The distribution of the Energy capacity of the device battery is right-skewed with a lot of outliers at the right side of the plot.
Days used : Number of days the used/refurbished device has been used
# Plot the histogram and boxplot for the number of days the used/refurbished device has been used
histogram_boxplot(df, "days_used")
The distribution of the number of days the used/refurbished device has been used is slightly right-skewed without any extreme value.
brand_name
# Plot the labeled barplot for the name of manufacturing brand
labeled_barplot(df, "brand_name", perc=True, n=10)
-The brand name distribution shows the percentage of market share of each manufacturing brand.
4g
# Plot the labeled barplot for the name of manufacturing brand
labeled_barplot(df, "4g", perc=True, n=10)
-There is 67.7% availability of the 4g devices in the market.
-There is 32.4% non-availability of the 4g devices in the market.
5g
# Plot the labeled barplot for the name of manufacturing brand
labeled_barplot(df, "5g", perc=True, n=10)
-There is 4.4% availability of the 5g devices in the market.
-There is 95.6% non-availability of the 5g devices in the market.
os
# Plot the labeled barplot for the operating system of the devices
labeled_barplot(df, "os", perc=True, n=10)
-Android operating system devices account for 93.1% of the market,which is the highest.
-The least in the market is the iOS operating system devices.
release_year: Year when the device model was released
# Plot the labeled barplot for the year when the device model was released
labeled_barplot(df, "release_year", perc=True, n=10)
-The plot indicates that a higher percentage (18.6%) of phone and tablet devices were released in 2014 compared to other years in the range.
-There is a drop in the percentage of releases by the year 2020 (8%), suggesting a decrease in the number of device releases over time.
Bivariate Analysis
Correlation check
Question: Which attributes are highly correlated with the normalized price of a used device?
cols_list = df.select_dtypes(include=np.number).columns.tolist()
# dropping release_year as it is a temporal variable
cols_list.remove("release_year")
plt.figure(figsize=(15, 7))
sns.heatmap(
df[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
-normalized_used_price has moderate positive corellation with screen_size, main_camera_mp, selfie_camera_mp, ram,battery,and strong positive correlation with normalized_new_price.
-The correlation between normalized_used_price and int_memory, and weight is weakly positive.
-normalized_used_price has weak negative correlation with days_used.
Plot the scatterplots for numerical features and the target variable, normalized_used_price
# Get the list of all numerical variables, excluding 'normalized_used_price' and 'release_year'
numerical_vars_all = df.select_dtypes(include=["float64", "int64"]).columns.tolist()
numerical_vars = [var for var in numerical_vars_all if var not in ["normalized_used_price", "release_year"]]
# Creating scatter plots for each numerical feature against the normalized used price
for feature in numerical_vars:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x=feature, y="normalized_used_price")
plt.title(f"Scatter plot of {feature} vs. Normalized Used Price")
plt.xlabel(feature)
plt.ylabel("Normalized Used Price")
plt.show()
normalized_used_price vs numerical variables shows no linear relationship.
Plot of normalized used price vs categorical variables
# Create a box plot of the normalized used price versus categorical variables
cat_vars = df.select_dtypes(include=["object", "category"]).columns.tolist()
for feature in cat_vars:
plt.figure(figsize=(18, 8)) # Set the figure size for each plot
sns.boxplot(x=feature, y="normalized_used_price", data=df, width=0.8)
plt.title(f"Box Plot of Normalized Used Price by {feature}")
plt.xlabel(feature)
plt.ylabel("Normalized Used Price")
plt.xticks(rotation=90)
plt.show() # Display the plot in each iteration of the loop
-Apple phone and tablet used devices has the highest normalized used price and celkon has the lowest.
-IOS-OS type has the highest normalized used price and Others-OS type has the lowest.
-Devices with 4g attracts higher normalized used price than used devices with no 4g.
-Devices with 5g attracts higher normalized used price than used devices with no 5g.
Brand name vs RAM
Question: The amount of RAM is important for the smooth functioning of a device. How does the amount of RAM vary with the brand?
# Variation of RAM with the brand
plt.figure(figsize=(12, 6))
sns.boxplot(x="brand_name", y="ram", data=df)
plt.title("Variation of RAM with Brand")
plt.xlabel("Brand Name")
plt.ylabel("RAM (GB)")
plt.xticks(rotation=90)
plt.show()
-OnePlus brand has the highest RAM, and Nokia has the lowest RAM. -There are more extreme values of RAM associated with many brands.
Brand name vs weight
Question: A large battery often increases a device's weight, making it feel uncomfortable in the hands. How does the weight vary for phones and tablets offering large batteries (more than 4500 mAh)?
# dataset for devices with large battery>4500 mAh
df_large_battery = df[df.battery > 4500]
df_large_battery.shape
(341, 15)
# Filter devices with large batteries (> 4500 mAh)
df_large_battery = df[df["battery"] > 4500]
# Create a box plot for 'brand_name' and 'weight' in the df_large_battery dataframe
plt.figure(figsize=(15, 5))
sns.boxplot(x="brand_name", y="weight", data=df_large_battery)
plt.xticks(rotation=90)
plt.title("Box Plot of Weight by Brand Name for Devices with Large Batteries")
plt.xlabel("Brand Name")
plt.ylabel("Weight (grams)")
plt.show()
-Samsung has the highest battery weight or capacity while Micromax has the least battery weight. -Outliers are noticeable with Honor, Others, and Xiaomi brands.
Brand name vs Screen size
Question: Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones and tablets are available across different brands with a screen size larger than 6 inches?
# Convert screen size from inches to cm (1 inch =2.54 cm)
df_large_screen = df[df.screen_size > 6 * 2.54]
df_large_screen.shape
(1099, 15)
# Call the labeled_barplot function to analyze the data
df_large_screen = df[
df.screen_size > 6 * 2.54
] ## Convert screen size from inches to cm (1 inch =2.54 cm)
labeled_barplot(data=df_large_screen, feature="brand_name", perc=True, n=10)
-Huawei brand of used devices has highest screen size> 6 inches.
-Motorola brand of used devices has the lowest screen size >6 inches.
brand name vs selfie camera mp
Question:A lot of devices nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of devices offering greater than 8MP selfie cameras across brands?
Distribution of devices offering greater than 8MP selfie cameras across brands
# Filter the original DataFrame for devices with selfie camera resolution greater than 8 megapixels
df_selfie_camera = df[df.selfie_camera_mp > 8]
df_selfie_camera.shape
(655, 15)
# Call the labeled_barplot function to analyze the data
labeled_barplot(data=df_selfie_camera, feature="brand_name", perc=True, n=10)
-There are more Huawei used devices with selfie camera mp> 8MP while Meizu used devices have the least number with selfie camera mp> 8MP.
main_camera_mp vs brand_name
Question: Main or Rear camera has a better resolution than Selfie or front Camera. What is the distribution of devices offering greater than 16MP main cameras across brands?
df_main_camera = df[df.main_camera_mp > 16]
df_main_camera.shape
(94, 15)
# Call the labeled_barplot function to analyze the data
labeled_barplot(data=df_main_camera, feature="brand_name", perc=True, n=10)
-Sony brand of used devices has highest number of devices with main camera mp> 16MP.
-BlackBerry brand of used devices have the lowest number of devices with main camera mp> 16MP.
Normalized used price vs Release year
# create lineplot of normalized_used_price and release_year
plt.figure(figsize=(12, 5))
sns.lineplot(x="release_year", y="normalized_used_price", data=df)
plt.show()
-The normalized used prices of used devices increase as the released year of the devices increases.
Data Preprocessing
Treating Missing Values
# create a copy of the dataset to keep the original dataset unchanged
df1 = df.copy()
# checking for missing values
df1.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 179 selfie_camera_mp 2 int_memory 4 ram 4 battery 6 weight 7 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
# List columns with missing values in the DataFrame df1
columns_with_missing_values = df.columns[df.isnull().any()].tolist()
# Displaying the column names with missing values
for col in columns_with_missing_values:
print(col)
main_camera_mp selfie_camera_mp int_memory ram battery weight
We will impute the missing values in the data by the column medians grouped by release_year and brand_name.
# Impute missing values in cols_impute with median by grouping the data on 'release_year' and 'brand_name'
for col in columns_with_missing_values:
df1[col] = df1[col].fillna(
value=df1.groupby(["release_year", "brand_name"])[col].transform("median")
)
# Check for missing values after imputing the above columns
missing_values_count = df1.isnull().sum()
print(missing_values_count)
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 179 selfie_camera_mp 2 int_memory 0 ram 0 battery 6 weight 7 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
We will impute the remaining missing values in the data by the column medians grouped by brand_name.
# Impute missing values in cols_impute with median by grouping the data on 'brand_name'
for col in columns_with_missing_values:
df1[col] = df1[col].fillna(
value=df1.groupby(['brand_name'])[col].transform("median")
)
# Check for missing values after imputing the above columns
missing_values_count = df1.isnull().sum()
print(missing_values_count)
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 10 selfie_camera_mp 0 int_memory 0 ram 0 battery 0 weight 0 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
We will fill the remaining missing values in the main_camera_mp column by the column median.
# Impute missing values in 'main_camera_mp' column with median
median_main_camera_mp = df1["main_camera_mp"].median()
df1["main_camera_mp"] = df1["main_camera_mp"].fillna(median_main_camera_mp)
# Check for missing values after imputing the "main_camera_mp" column
missing_values_count = df1.isnull().sum()
print(missing_values_count)
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 0 selfie_camera_mp 0 int_memory 0 ram 0 battery 0 weight 0 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
All the missing values have been filled.
Feature Engineering
Handling Categorical Variables: Convert categorical variables like 'brand_name' and 'os' into numerical format using one-hot encoding or label encoding.
# One-hot encoding for 'os' column
df1_encoded = pd.get_dummies(df1, columns=['os'], prefix='os')
# One-hot encoding for 'os' and 'brand_name' columns
df1_encoded = pd.get_dummies(df1, columns=['os', 'brand_name'], prefix=['os', 'brand_name'])
from sklearn.preprocessing import LabelEncoder
# Check if the columns '4g' and '5g' exist in the DataFrame
if "4g" in df1.columns and "5g" in df1.columns:
# Initialize LabelEncoder
label_encoder = LabelEncoder()
# Encode '4g' and '5g' columns with 'yes' as 1 and 'no' as 0
df1["4g_encoded"] = label_encoder.fit_transform(df1["4g"])
df1["5g_encoded"] = label_encoder.fit_transform(df1["5g"])
else:
print("Columns '4g' and/or '5g' not found in the DataFrame.")
Temporal Features:
Extract temporal features like 'age_of_device' from 'release_year' to capture the age of the device.
# Calculate age of the device based on release year
current_year = 2021
df1["age_of_device"] = current_year - df1["release_year"]
df1.drop("release_year", axis=1, inplace=True)
df1["age_of_device"].describe()
count 3454.000000 mean 5.034742 std 2.298455 min 1.000000 25% 3.000000 50% 5.500000 75% 7.000000 max 8.000000 Name: age_of_device, dtype: float64
Outliers detention and treatment
# outlier detection using boxplot
num_cols = df1.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 15))
for i, variable in enumerate(num_cols):
if i < 12: # Limit the subplots to 12
plt.subplot(3, 4, i + 1)
sns.boxplot(data=df1, x=variable)
plt.subplots_adjust(wspace=0.5, hspace=0.5) # Adjust the space between subplots
plt.show()
-There are extreme values notable with screen_size, main_camera_mp, selfie_camera_mp, int_memory, normalized_new_price, battery, weight,normalized_used_price, and sreen_size_inches.
-There are high extreme values with weight and battery.
-The extremw value with all dummie variables are to be ignored.
Outliers treatment
# Calculate the median for updated numerical variables
median = df1[numerical_vars].median()
# Create a copy of the original DataFrame
df1_outliers_corrected = df1.copy()
# Determine outliers using a condition based on median value and standard deviation (adjust as needed)
lower_bound = median - 3 * df1[numerical_vars].std()
upper_bound = median + 3 * df1[numerical_vars].std()
outliers = (df1[numerical_vars] < lower_bound) | (df1[numerical_vars] > upper_bound)
Data Preparation for modeling
# separating independent variables from dependent variables
X = df1.drop(columns=["normalized_used_price"]) # independent variables
# target_variable for prediction
y = df1['normalized_used_price']
print(X.head())
print()
print(y.head())
brand_name os screen_size 4g 5g main_camera_mp \ 0 Honor Android 14.50 yes no 13.0 1 Honor Android 17.30 yes yes 13.0 2 Honor Android 16.69 yes yes 13.0 3 Honor Android 25.50 yes yes 13.0 4 Honor Android 15.32 yes no 13.0 selfie_camera_mp int_memory ram battery weight days_used \ 0 5.0 64.0 3.0 3020.0 146.0 127 1 16.0 128.0 8.0 4300.0 213.0 325 2 8.0 128.0 8.0 4200.0 213.0 162 3 8.0 64.0 6.0 7250.0 480.0 345 4 8.0 64.0 3.0 5000.0 185.0 293 normalized_new_price 4g_encoded 5g_encoded age_of_device 0 4.715100 1 0 1 1 5.519018 1 1 1 2 5.884631 1 1 1 3 5.630961 1 1 1 4 4.947837 1 0 1 0 4.307572 1 5.162097 2 5.111084 3 5.135387 4 4.389995 Name: normalized_used_price, dtype: float64
# let's add the intercept to data
X = sm.add_constant(X)
# creating dummy variables
X = pd.get_dummies(
X,
columns=X.select_dtypes(include=["object", "category"]).columns.tolist(),
drop_first=True,
) ## create dummies for independent features
X.head()
| const | screen_size | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | days_used | normalized_new_price | ... | brand_name_Spice | brand_name_Vivo | brand_name_XOLO | brand_name_Xiaomi | brand_name_ZTE | os_Others | os_Windows | os_iOS | 4g_yes | 5g_yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 14.50 | 13.0 | 5.0 | 64.0 | 3.0 | 3020.0 | 146.0 | 127 | 4.715100 | ... | False | False | False | False | False | False | False | False | True | False |
| 1 | 1.0 | 17.30 | 13.0 | 16.0 | 128.0 | 8.0 | 4300.0 | 213.0 | 325 | 5.519018 | ... | False | False | False | False | False | False | False | False | True | True |
| 2 | 1.0 | 16.69 | 13.0 | 8.0 | 128.0 | 8.0 | 4200.0 | 213.0 | 162 | 5.884631 | ... | False | False | False | False | False | False | False | False | True | True |
| 3 | 1.0 | 25.50 | 13.0 | 8.0 | 64.0 | 6.0 | 7250.0 | 480.0 | 345 | 5.630961 | ... | False | False | False | False | False | False | False | False | True | True |
| 4 | 1.0 | 15.32 | 13.0 | 8.0 | 64.0 | 3.0 | 5000.0 | 185.0 | 293 | 4.947837 | ... | False | False | False | False | False | False | False | False | True | False |
5 rows × 51 columns
# Perform the train-test split with a 70:30 ratio
x_train, x_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Code is assuming 'X' contains the independent variables and 'y' is the dependent variable
# 'test_size=0.3' specifies a 70:30 split for the training and testing data
# 'random_state=42' ensures reproducibility of the split
# Print the shapes of the resulting datasets
print("Shape of x_train:", x_train.shape[0])
print("Shape of x_test:", x_test.shape[0])
print("Shape of y_train:", y_train.shape[0])
print("Shape of y_test:", y_test.shape[0])
Shape of x_train: 2417 Shape of x_test: 1037 Shape of y_train: 2417 Shape of y_test: 1037
Exploratory Data Analysis
It is good to perform exploratory data analysis after manipulation of data to gain insights, identify patterns, and understand the underlying structure of the data.
Statistical summary of the dataset
df1.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| brand_name | 3454 | 34 | Others | 502 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| os | 3454 | 4 | Android | 3214 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| screen_size | 3454.0 | NaN | NaN | NaN | 13.713115 | 3.80528 | 5.08 | 12.7 | 12.83 | 15.34 | 30.71 |
| 4g | 3454 | 2 | yes | 2335 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5g | 3454 | 2 | no | 3302 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| main_camera_mp | 3454.0 | NaN | NaN | NaN | 9.617597 | 4.749438 | 0.08 | 5.0 | 8.0 | 13.0 | 48.0 |
| selfie_camera_mp | 3454.0 | NaN | NaN | NaN | 6.555067 | 6.96844 | 0.0 | 2.0 | 5.0 | 8.0 | 32.0 |
| int_memory | 3454.0 | NaN | NaN | NaN | 54.528474 | 84.934991 | 0.01 | 16.0 | 32.0 | 64.0 | 1024.0 |
| ram | 3454.0 | NaN | NaN | NaN | 4.03262 | 1.368598 | 0.02 | 4.0 | 4.0 | 4.0 | 12.0 |
| battery | 3454.0 | NaN | NaN | NaN | 3132.577446 | 1298.884193 | 500.0 | 2100.0 | 3000.0 | 4000.0 | 9720.0 |
| weight | 3454.0 | NaN | NaN | NaN | 182.636856 | 88.360445 | 69.0 | 142.0 | 160.0 | 185.0 | 855.0 |
| days_used | 3454.0 | NaN | NaN | NaN | 674.869716 | 248.580166 | 91.0 | 533.5 | 690.5 | 868.75 | 1094.0 |
| normalized_used_price | 3454.0 | NaN | NaN | NaN | 4.364712 | 0.588914 | 1.536867 | 4.033931 | 4.405133 | 4.7557 | 6.619433 |
| normalized_new_price | 3454.0 | NaN | NaN | NaN | 5.233107 | 0.683637 | 2.901422 | 4.790342 | 5.245892 | 5.673718 | 7.847841 |
| 4g_encoded | 3454.0 | NaN | NaN | NaN | 0.676028 | 0.468057 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| 5g_encoded | 3454.0 | NaN | NaN | NaN | 0.044007 | 0.20514 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| age_of_device | 3454.0 | NaN | NaN | NaN | 5.034742 | 2.298455 | 1.0 | 3.0 | 5.5 | 7.0 | 8.0 |
-The average values of screen_size, days_used, normalized_used_price, and normalized_used_price are the same after data manipulation.
-The std and the range of screen_size, days_used, normalized_used_price, and normalized_used_price are the same after data manipulation.
-There were slightly changes in the average values and std of main_camera_mp, selfie_camera_mp,ram,battery,weight,and int_memory after data manipulation, but their median remains the same.
-The release_year has been replaced with age_of_device
-4g_encoded and 5g_encoded are dummie variables that can't influence the model prediction.
Check for missing values
df1.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 0 selfie_camera_mp 0 int_memory 0 ram 0 battery 0 weight 0 days_used 0 normalized_used_price 0 normalized_new_price 0 4g_encoded 0 5g_encoded 0 age_of_device 0 dtype: int64
There is no missing values
histogram_boxplot(df, "normalized_used_price")
histogram_boxplot(df, "normalized_new_price")
histogram_boxplot(df, "screen_size")
histogram_boxplot(df, "main_camera_mp")
histogram_boxplot(df, "selfie_camera_mp")
histogram_boxplot(df, "int_memory")
histogram_boxplot(df, "ram")
histogram_boxplot(df, "weight")
histogram_boxplot(df, "battery")
histogram_boxplot(df, "days_used")
labeled_barplot(df, "brand_name", perc=True, n=10)
labeled_barplot(df, "os", perc=True, n=10)
labeled_barplot(df, "4g", perc=True, n=10)
labeled_barplot(df, "5g", perc=True, n=10)
labeled_barplot(df, "release_year", perc=True, n=10)
The data analysis of both the numerical and categorical variables after data manipulations are the same before the data manipulations.
Bivariate Analysis
Correlation check
cols_list = df.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 7))
sns.heatmap(
df[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
-The correlation of normalisedused price with numerical variables after data manipulation is the same before data manipulation
-normalized_used_price has moderate positive corellation with screen_size, main_camera_mp, selfie_camera_mp, ram,battery,and strong positive correlation with normalized_new_price.
-The correlation between normalized_used_price and int_memory, and weight is weakly positive.
-normalized_used_price has weak negative correlation with days_used.
Model Building - Linear Regression
# Convert pandas DataFrames to numpy arrays for endogenous and exogenous variables
y_train_arr = y_train.values.astype(float) # Convert to float type
x_train_arr = x_train.values.astype(float) # Convert to float type
# Fit the OLS model using numpy arrays
olsmodel1 = sm.OLS(y_train_arr, x_train_arr)
ols_result = olsmodel1.fit()
# Print the summary of the OLS model
print(ols_result.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.849
Model: OLS Adj. R-squared: 0.846
Method: Least Squares F-statistic: 277.1
Date: Fri, 22 Mar 2024 Prob (F-statistic): 0.00
Time: 15:34:46 Log-Likelihood: 125.15
No. Observations: 2417 AIC: -152.3
Df Residuals: 2368 BIC: 131.4
Df Model: 48
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.4172 0.072 19.677 0.000 1.276 1.558
x1 0.0295 0.004 8.352 0.000 0.023 0.036
x2 0.0232 0.002 14.892 0.000 0.020 0.026
x3 0.0116 0.001 9.924 0.000 0.009 0.014
x4 0.0002 6.76e-05 2.779 0.005 5.53e-05 0.000
x5 0.0305 0.005 5.797 0.000 0.020 0.041
x6 -1.665e-05 7.35e-06 -2.266 0.024 -3.11e-05 -2.24e-06
x7 0.0008 0.000 5.946 0.000 0.001 0.001
x8 3.376e-05 3.07e-05 1.101 0.271 -2.64e-05 9.39e-05
x9 0.4104 0.012 33.494 0.000 0.386 0.434
x10 0.0203 0.008 2.514 0.012 0.004 0.036
x11 -0.0458 0.016 -2.846 0.004 -0.077 -0.014
x12 -0.0255 0.005 -5.589 0.000 -0.034 -0.017
x13 -0.0804 0.050 -1.618 0.106 -0.178 0.017
x14 -0.0438 0.148 -0.297 0.767 -0.333 0.246
x15 0.0068 0.049 0.138 0.890 -0.090 0.103
x16 0.0312 0.072 0.434 0.664 -0.110 0.172
x17 -0.2372 0.068 -3.485 0.001 -0.371 -0.104
x18 -0.0308 0.071 -0.434 0.664 -0.170 0.108
x19 -0.0130 0.059 -0.221 0.825 -0.128 0.102
x20 -0.1192 0.083 -1.442 0.149 -0.281 0.043
x21 -0.0403 0.050 -0.805 0.421 -0.138 0.058
x22 -0.0478 0.051 -0.941 0.347 -0.147 0.052
x23 -0.0599 0.046 -1.299 0.194 -0.150 0.030
x24 0.1338 0.113 1.179 0.238 -0.089 0.356
x25 -0.0592 0.068 -0.867 0.386 -0.193 0.075
x26 -0.0608 0.047 -1.299 0.194 -0.152 0.031
x27 -0.0230 0.063 -0.364 0.716 -0.147 0.101
x28 -0.0364 0.047 -0.771 0.441 -0.129 0.056
x29 -0.0860 0.056 -1.530 0.126 -0.196 0.024
x30 -0.0645 0.049 -1.315 0.189 -0.161 0.032
x31 0.0749 0.082 0.916 0.360 -0.085 0.235
x32 -0.0686 0.051 -1.349 0.177 -0.168 0.031
x33 0.0380 0.052 0.724 0.469 -0.065 0.141
x34 -0.0384 0.073 -0.523 0.601 -0.182 0.105
x35 -0.0294 0.049 -0.599 0.549 -0.126 0.067
x36 -0.0679 0.044 -1.556 0.120 -0.154 0.018
x37 -0.0427 0.062 -0.690 0.490 -0.164 0.078
x38 -0.0359 0.063 -0.568 0.570 -0.160 0.088
x39 -0.0617 0.045 -1.376 0.169 -0.150 0.026
x40 -0.0756 0.053 -1.428 0.153 -0.179 0.028
x41 -0.0356 0.068 -0.520 0.603 -0.170 0.099
x42 -0.0644 0.050 -1.277 0.202 -0.163 0.034
x43 -0.0781 0.057 -1.362 0.173 -0.191 0.034
x44 0.0325 0.050 0.655 0.513 -0.065 0.130
x45 -0.0460 0.048 -0.952 0.341 -0.141 0.049
x46 -0.0604 0.033 -1.856 0.064 -0.124 0.003
x47 -0.0374 0.043 -0.861 0.389 -0.122 0.048
x48 -0.0141 0.148 -0.095 0.924 -0.304 0.276
x49 0.0203 0.008 2.514 0.012 0.004 0.036
x50 -0.0458 0.016 -2.846 0.004 -0.077 -0.014
==============================================================================
Omnibus: 234.465 Durbin-Watson: 1.994
Prob(Omnibus): 0.000 Jarque-Bera (JB): 630.705
Skew: -0.536 Prob(JB): 1.11e-137
Kurtosis: 5.262 Cond. No. 1.07e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.48e-22. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
R-squared value of 0.849 means that ~85% of the variance in the dependent variable is explained by the independent variables included in the model.
An Adjusted R-squared value of 0.846 indicates that the model fits the data well and is robust even after accounting for the number of predictors.
R-squared value and an Adjusted R-squared value close to 1 indicates a strong relationship between the independent and dependent variables.
A higher R-squared value of 0.849 signifies better model fit.
A high F-statistic of 277.1 indicates that the regression model is statistically significant, meaning that at least one of the independent variables has a significant effect on the dependent variable.
Model Performance Check
We shall check the performance of the model by calculating:
RMSE:Root Mean Squared Error
MAE:Mean Absolute Error
𝑅-squared:Coefficient of Determination
Adj R-Modification of 𝑅-squared
MAPE:Mean Absolute Percentage Error
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# function to compute MAPE
def mape_score(targets, predictions):
return np.mean(np.abs(targets - predictions) / targets) * 100
# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
r2 = r2_score(target, pred) # to compute R-squared
adjr2 = adj_r2_score(predictors, target, pred) # to compute adjusted R-squared
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mape = mape_score(target, pred) # to compute MAPE
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"RMSE": rmse,
"MAE": mae,
"R-squared": r2,
"Adj. R-squared": adjr2,
"MAPE": mape,
},
index=[0],
)
return df_perf
# checking model performance on train set (seen 70% data)
olsmodel1 = LinearRegression()
olsmodel1.fit(x_train, y_train)
print("Training Performance\n")
olsmodel1_train_perf = model_performance_regression(olsmodel1, x_train, y_train)
olsmodel1_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.229761 | 0.178533 | 0.848887 | 0.845628 | 4.293664 |
# checking model performance on test set (seen 30% data)
olsmodel1 = LinearRegression()
olsmodel1.fit(x_train, y_train)
print("Test Performance\n")
olsmodel1_test_perf = model_performance_regression(olsmodel1, x_test, y_test)
olsmodel1_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.239062 | 0.188692 | 0.832176 | 0.823486 | 4.513288 |
-The train R-squared value is ~0.85 which indicates that the model is not underfitting.
-A test R-squared value close to the training R-squared (without a significant drop) indicates that the model generalizes well to unseen data and is not underfitting.
-The train and test RMSE and MAE are approximately comparable. Therefore, the model is not overfitting.
-MAE suggests that the model can predict normalized_used_price with a mean error of ~0.19 on test data.
-MAPE of ~4.51 on test data means that we are able to predict within 45% of normalized used prices of the refurbished devices.
Checking Linear Regression Assumptions:
-No Multicollinearity
-Linearity of variables
-Independence of error terms
-Normality of error terms
-No Heteroscedasticity
Test for Multicollinearity
Define a function to check VIF
# Test for multicolinearity using Variance Inflation Factor(VIF)
# Let's define a function to check VIF.
def checking_vif(predictors):
# Check and clean the predictors DataFrame
predictors = predictors.select_dtypes(include=[np.number]) # Select numeric columns only
vif = pd.DataFrame()
vif["feature"] = predictors.columns
# Calculate VIF for each feature
vif["VIF"] = [variance_inflation_factor(predictors.values, i) for i in range(len(predictors.columns))]
return vif
# Call the checking_vif function on the train data to check VIF
train_vif = checking_vif(x_train)
print(train_vif)
feature VIF 0 const 95.880541 1 screen_size 6.462515 2 main_camera_mp 2.055485 3 selfie_camera_mp 2.565127 4 int_memory 1.224326 5 ram 2.008373 6 battery 3.836712 7 weight 5.554191 8 days_used 2.472959 9 normalized_new_price 2.570473 10 4g_encoded 2.310097 11 5g_encoded 1.695259 12 age_of_device 4.505587
-There exist multicolinearity between some independent variables as indicated by high
Removing Multicolinearity
-Drop every column one by one that has a VIF score greater than 5.A VIF above 5 signals high multicollinearity that may distort the interpretation of the coefficients.
-Look at the adjusted R-squared and RMSE of all these models.
-Drop the variable that makes the least change in adjusted R-squared.
-Check the VIF scores again.
-Continue till you get all VIF scores under 5.
-Ignore the VIF values of constant, and dummie variables because they can't influence our model for prediction.
# specify the column to drop
col_to_drop = "screen_size" ## Column to be dropped
# specify the train data from which to drop the column specified
x_train2 = x_train.loc[:, ~x_train.columns.str.startswith(col_to_drop)]
x_test2 = x_test.loc[:, ~x_test.columns.str.startswith(col_to_drop)]
# Check VIF now
vif = checking_vif(x_train2)
print("VIF after dropping", col_to_drop)
vif
VIF after dropping screen_size
| feature | VIF | |
|---|---|---|
| 0 | const | 88.136335 |
| 1 | main_camera_mp | 2.044983 |
| 2 | selfie_camera_mp | 2.562639 |
| 3 | int_memory | 1.220129 |
| 4 | ram | 1.970646 |
| 5 | battery | 3.555564 |
| 6 | weight | 2.737859 |
| 7 | days_used | 2.467377 |
| 8 | normalized_new_price | 2.512565 |
| 9 | 4g_encoded | 2.309453 |
| 10 | 5g_encoded | 1.678568 |
| 11 | age_of_device | 4.389092 |
# checking model performance on train set (seen 70% data)
olsmodel1 = LinearRegression()
olsmodel1.fit(x_train, y_train)
print("Training Performance\n")
olsmodel1_train_perf = model_performance_regression(olsmodel1, x_train, y_train)
olsmodel1_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.229761 | 0.178533 | 0.848887 | 0.845628 | 4.293664 |
# checking model performance on test set (seen 30% data)
olsmodel1 = LinearRegression()
olsmodel1.fit(x_train, y_train)
print("Test Performance\n")
olsmodel1_test_perf = model_performance_regression(olsmodel1, x_test, y_test)
olsmodel1_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.239062 | 0.188692 | 0.832176 | 0.823486 | 4.513288 |
# specify the column to drop
col_to_drop = "weight" ## Column to be dropped
# specify the train data from which to drop the column specified
x_train2 = x_train.loc[:, ~x_train.columns.str.startswith(col_to_drop)]
x_test2 = x_test.loc[:, ~x_test.columns.str.startswith(col_to_drop)]
# Check VIF now
vif = checking_vif(x_train2)
print("VIF after dropping", col_to_drop)
vif
VIF after dropping weight
| feature | VIF | |
|---|---|---|
| 0 | const | 87.709984 |
| 1 | screen_size | 3.185604 |
| 2 | main_camera_mp | 1.934964 |
| 3 | selfie_camera_mp | 2.531386 |
| 4 | int_memory | 1.224326 |
| 5 | ram | 1.970930 |
| 6 | battery | 3.473183 |
| 7 | days_used | 2.461660 |
| 8 | normalized_new_price | 2.567212 |
| 9 | 4g_encoded | 2.282841 |
| 10 | 5g_encoded | 1.687623 |
| 11 | age_of_device | 4.402847 |
# checking model performance on train set (seen 70% data)
olsmodel1 = LinearRegression()
olsmodel1.fit(x_train, y_train)
print("Training Performance\n")
olsmodel1_train_perf = model_performance_regression(olsmodel1, x_train, y_train)
olsmodel1_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.229761 | 0.178533 | 0.848887 | 0.845628 | 4.293664 |
# checking model performance on test set (seen 30% data)
olsmodel1 = LinearRegression()
olsmodel1.fit(x_train, y_train)
print("Test Performance\n")
olsmodel1_test_perf = model_performance_regression(olsmodel1, x_test, y_test)
olsmodel1_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.239062 | 0.188692 | 0.832176 | 0.823486 | 4.513288 |
It is observed that the removal of both screen_size and weight variables one by one because of their VIF>5, indicates that the performance metrics remain the same for train and test performances.
Dropping high p-value variables
define loop function to drop high p-value variables
# Convert the 'y_train' Pandas Series to a NumPy array with appropriate data type
x_train_array = x_train.values.astype(float)
y_train_array = y_train.values.astype(float)
# Initial list of columns
cols = x_train.columns.tolist()
# Setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# Defining the train set
x_train_aux = x_train_array[:, [x_train.columns.get_loc(col) for col in cols]]
# Fitting the model
model = sm.OLS(y_train_array, sm.add_constant(x_train_aux)).fit()
# Getting the p-values and the maximum p-value
p_values = model.pvalues
# Find the index of the variable with the maximum p-value
max_p_value_index = np.argmax(p_values)
max_p_value = p_values[max_p_value_index]
if max_p_value > 0.05:
feature_with_p_max = cols[max_p_value_index]
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
['const', 'screen_size', 'main_camera_mp', 'selfie_camera_mp', 'int_memory', 'ram', 'battery', 'weight', 'normalized_new_price', '4g_encoded', '5g_encoded', 'age_of_device', 'brand_name_Celkon', 'brand_name_Nokia', 'brand_name_Xiaomi', 'os_Others', '4g_yes', '5g_yes']
# Specify the train and test data with selected features
x_train3 = x_train[selected_features]
x_test3 = x_test[selected_features]
# Convert Pandas Series 'y_train' to a NumPy array with a suitable data type
y_train_array = y_train.values.astype(float)
# Define x_train3 with selected features from x_train
x_train3 = x_train[selected_features]
# Convert x_train3 Pandas DataFrame to a NumPy array with a suitable data type
x_train3_array = x_train3.values.astype(float)
# Fit OLS model on the updated dataset
olsmodel2 = sm.OLS(y_train_array, sm.add_constant(x_train3_array)).fit()
# Print the summary of the OLS model
print(olsmodel2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.847
Model: OLS Adj. R-squared: 0.846
Method: Least Squares F-statistic: 886.8
Date: Fri, 22 Mar 2024 Prob (F-statistic): 0.00
Time: 15:34:48 Log-Likelihood: 110.96
No. Observations: 2417 AIC: -189.9
Df Residuals: 2401 BIC: -97.27
Df Model: 15
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.3715 0.052 26.565 0.000 1.270 1.473
x1 0.0291 0.003 8.473 0.000 0.022 0.036
x2 0.0234 0.001 16.151 0.000 0.021 0.026
x3 0.0119 0.001 10.644 0.000 0.010 0.014
x4 0.0002 6.66e-05 2.836 0.005 5.83e-05 0.000
x5 0.0293 0.005 5.686 0.000 0.019 0.039
x6 -1.46e-05 7.19e-06 -2.030 0.043 -2.87e-05 -4.94e-07
x7 0.0008 0.000 6.200 0.000 0.001 0.001
x8 0.4092 0.011 36.760 0.000 0.387 0.431
x9 0.0230 0.008 3.023 0.003 0.008 0.038
x10 -0.0450 0.016 -2.862 0.004 -0.076 -0.014
x11 -0.0218 0.004 -5.847 0.000 -0.029 -0.014
x12 -0.1905 0.053 -3.571 0.000 -0.295 -0.086
x13 0.0823 0.030 2.771 0.006 0.024 0.140
x14 0.0829 0.025 3.296 0.001 0.034 0.132
x15 -0.0598 0.030 -1.993 0.046 -0.119 -0.001
x16 0.0230 0.008 3.023 0.003 0.008 0.038
x17 -0.0450 0.016 -2.862 0.004 -0.076 -0.014
==============================================================================
Omnibus: 240.704 Durbin-Watson: 1.993
Prob(Omnibus): 0.000 Jarque-Bera (JB): 674.212
Skew: -0.537 Prob(JB): 3.95e-147
Kurtosis: 5.354 Cond. No. 5.23e+19
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.02e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
R-squared value of 0.847 means that approximately 84.7% of the variance in the dependent variable is explained by the independent variables included in the model.
An Adjusted R-squared value of 0.846 indicates that the model fits the data well and is robust even after accounting for the number of predictors.
R-squared value and an Adjusted R-squared value close to 1 indicates a strong relationship between the independent and dependent variables.
A higher R-squared value of 0.847 signifies better model fit.
A high F-statistic of 886.8 indicates that the regression model is statistically significant, meaning that at least one of the independent variables has a significant effect on the dependent variable.
Check the rest of the assumptions on oslmodel2.
Test of Linearity and Independence
# Create a DataFrame to store actual, fitted, and residual values
df_pred = pd.DataFrame()
# Add actual, fitted, and residual values to the DataFrame
df_pred["Actual Values"] = y_train # actual values
df_pred["Fitted Values"] = olsmodel2.fittedvalues # predicted values
df_pred["Residuals"] = olsmodel2.resid # residuals
# Display the first few rows of the DataFrame
df_pred.head()
| Actual Values | Fitted Values | Residuals | |
|---|---|---|---|
| 1744 | 4.261975 | 4.311553 | -0.049578 |
| 3141 | 4.175156 | 3.860832 | 0.314324 |
| 1233 | 4.117410 | 4.433246 | -0.315836 |
| 3046 | 3.782597 | 3.848331 | -0.065733 |
| 2649 | 3.981922 | 3.920911 | 0.061011 |
# let's plot the fitted values vs residuals
sns.residplot(
data=df_pred, x="Fitted Values", y="Residuals", color="purple", lowess=True
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Fitted vs Residual plot")
plt.show()
-The above scatterplots shows the distribution of residuals(errors) versus fitted values(predicted values).
-The plot above shows no pattern of relationship.Hence, the assumptions of linearity and independence are satisfied.
Test of Normality
# Plot the distribution of residuals as a histogram using Seaborn
sns.histplot(
data=df_pred, x="Residuals", kde=True
) # specify 'x' as 'Residuals' for the histogram
plt.title("Normality of residuals")
plt.show()
The distribution of residuals is very close to normal,but not symmetric around zero but slightly skewed or tilted at the top.
import pylab
import scipy.stats as stats
# Create a Q-Q plot to check the normality of residuals
stats.probplot(df_pred["Residuals"], dist="norm", plot=pylab)
plt.show()
-The residuals more or less follows a straight line except for the tails.
-We shall check the results with Shapiro-Wilk test.
# Perform Shapiro-Wilk test of normality on the Residuals
stats.shapiro(df_pred["Residuals"])
ShapiroResult(statistic=0.963483452796936, pvalue=2.9948930717316565e-24)
-Because p-value<0.05, the Residuals are not normal under Shapiro-Wilk test.
-We can approximately accept that the distribution is close to normal.
-We conclude that the assumption is satisfied.
Test of Homoscedasticity
We will test for homoscedasticity by using the goldfeldquandt test.
If we get a p-value greater than 0.05, we can say that the residuals are homoscedastic.
Otherwise, they are heteroscedastic.
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name = ["F statistic", "p-value"]
# Define the selected features based on column names present in df_pred
selected_features = [
"Actual Values",
"Fitted Values",
] # Update this with appropriate columns from df_pred
# Apply the Goldfeld-Quandt test for heteroscedasticity
test = sms.het_goldfeldquandt(df_pred["Residuals"], df_pred[selected_features])
lzip(name, test)
[('F statistic', 0.48604057532391814), ('p-value', 0.9999999999999999)]
The p-value = o.999999>0.05 indicates that Residuals are homoscedastic. Therefore, the assumption is satisfied.
Predictions on test data
pred = olsmodel2.predict(x_test3)
df_pred_test = pd.DataFrame({"Actual": y_test, "Predicted": pred})
df_pred_test.sample(10, random_state=1)
| Actual | Predicted | |
|---|---|---|
| 1174 | 4.553772 | 4.494382 |
| 2247 | 4.330996 | 4.10073 |
| 2991 | 4.180828 | 4.068603 |
| 2997 | 4.291007 | 4.238787 |
| 2982 | 4.649761 | 4.510048 |
| 479 | 4.284689 | 4.227465 |
| 17 | 4.777020 | 4.715704 |
| 471 | 4.721886 | 4.487669 |
| 2918 | 4.482550 | 4.501051 |
| 602 | 3.852485 | 4.161353 |
The Actual and the Predicted values are comparable, and this indicates that our predicted values are good.
Final Model Summary
# Define x_train_final needs to be defined based on x_train3
x_train_final = x_train3.copy()
# Convert Pandas Series 'y_train' and feature columns to float type
y_train_float = y_train.astype(float)
x_train_final_float = x_train_final.astype(float)
# Fit the OLS model on the updated dataset with float type data
olsmodel_final = sm.OLS(y_train_float, sm.add_constant(x_train_final_float)).fit()
# Print the summary of the OLS model
print(olsmodel_final.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.847
Model: OLS Adj. R-squared: 0.846
Method: Least Squares F-statistic: 886.8
Date: Fri, 22 Mar 2024 Prob (F-statistic): 0.00
Time: 16:58:43 Log-Likelihood: 110.96
No. Observations: 2417 AIC: -189.9
Df Residuals: 2401 BIC: -97.27
Df Model: 15
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 1.3715 0.052 26.565 0.000 1.270 1.473
screen_size 0.0291 0.003 8.473 0.000 0.022 0.036
main_camera_mp 0.0234 0.001 16.151 0.000 0.021 0.026
selfie_camera_mp 0.0119 0.001 10.644 0.000 0.010 0.014
int_memory 0.0002 6.66e-05 2.836 0.005 5.83e-05 0.000
ram 0.0293 0.005 5.686 0.000 0.019 0.039
battery -1.46e-05 7.19e-06 -2.030 0.043 -2.87e-05 -4.94e-07
weight 0.0008 0.000 6.200 0.000 0.001 0.001
normalized_new_price 0.4092 0.011 36.760 0.000 0.387 0.431
4g_encoded 0.0230 0.008 3.023 0.003 0.008 0.038
5g_encoded -0.0450 0.016 -2.862 0.004 -0.076 -0.014
age_of_device -0.0218 0.004 -5.847 0.000 -0.029 -0.014
brand_name_Celkon -0.1905 0.053 -3.571 0.000 -0.295 -0.086
brand_name_Nokia 0.0823 0.030 2.771 0.006 0.024 0.140
brand_name_Xiaomi 0.0829 0.025 3.296 0.001 0.034 0.132
os_Others -0.0598 0.030 -1.993 0.046 -0.119 -0.001
4g_yes 0.0230 0.008 3.023 0.003 0.008 0.038
5g_yes -0.0450 0.016 -2.862 0.004 -0.076 -0.014
==============================================================================
Omnibus: 240.704 Durbin-Watson: 1.993
Prob(Omnibus): 0.000 Jarque-Bera (JB): 674.212
Skew: -0.537 Prob(JB): 3.95e-147
Kurtosis: 5.354 Cond. No. 5.23e+19
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.02e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
# Check model performance on train set ( 70& of the data seen)
print(" Training Performance")
olsmodel_final_train_perf = model_performance_regression(
olsmodel_final, x_train_final, y_train
)
olsmodel_final_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.231114 | 0.179533 | 0.847102 | 0.845954 | 4.322087 |
# checking model performance on test set (30% of the data seen)
print("Test Performance\n")
olsmodel_final_test_perf = model_performance_regression(
olsmodel_final, x_test_final, y_test
)
olsmodel_final_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.238225 | 0.188703 | 0.833349 | 0.830402 | 4.513062 |
-The model is able to explain ~85% variation in the data.
-The train and test RMSE and MAE are low but comparable. This suggests that our model is not characterized with overfitting
-The MAPE on the test dataset suggests we can predict 45% of the normalized used prices of the refurbished/used devices.
Actionable Insights and Recommendations.
Insights
1.The model is able to explain ~85% variation in the data and within 45% of the normalized used prices of the refurbished/used devices on the test data. -This indicates that the model is good for prediction as well as inference purposes.
2.If normalized new price of used devices increases by one unit, then the normalised used price for the used devices will increase by 0.4092 units, all other variables held constant.
2.If RAM increases by one unit, then the 'normalised used price' for the used devices will increase by 0.0293 units, all other variables held constant.
3.If screen size increases by one unit, then the 'normalised used price' for the used devices will increase by 0.0291 units, all other variables held constant.
4.If main camera’s resolution based on the number of pixels it’s capable of capturing increases by one unit, then the 'normalised used price' for the used devices will increase by 0.0234 units, all other variables held constant.
5.If front camera’s resolution based on the number of pixels it’s capable of capturing increases by one mAh, then the 'normalised used price' for the used devices will increase by 0.0119 units, all other variables held constant.
6.If the age of a used device( i.e., the years since released) increases by one year, then the normalised used price for the used devices will decrease by 0.0218 units, all other variables held constant.
7.If weight of the used devices increases by one unit, then the 'normalised used price' for the used devices will increase by 0.0008 units, all other variables held constant.
8.The positive coefficients suggest that holding all other variables constant, having a Nokia or Xiaomi brand associated with a mobile phone or tablet tends to increase the 'normalized_used_price' compared to the other brands by 0.0823 units and 0.829 units respectively.
9.The more brands with 'os-Others' as operatiing system of the used devices will decrease the 'normalized used price' of the used devices by 0.0598 units.
10.If the internal memory of the used devices increases by 1 unit, then the 'normalised used price' for the used devices will increse by 0.002 units, all other variables held constant.
11.Variables such as 'screen_size', 'main_camera_mp', 'selfie_camera_mp', 'ram', 'weight', 'normalized_new_price', 'brand_name_Celkon', 'brand_name_Nokia', 'brand_name_Xiaomi', 'os_Others' are statistically significant as their p-values are close to or equal to 0.
Recommendations
1.It is crucial to consider the comparative pricing of new devices when setting prices for used or refurbished devices.
2.Emphasizing devices with higher RAM might attract higher selling prices.
3.Considering devices with larger screens might enhance pricing strategies.
4.Devices with better camera capabilities may fetch a higher price. Highlighting camera quality can be a selling point.
5.Consider offering discounts or adjusting prices based on the age of the device to align with market demands for newer devices.
6.Lighter devices may be perceived as more convenient, potentially impacting pricing strategies to cater to user preferences.
7.Leveraging the brand reputation and customer preferences for these brands can be beneficial in pricing used devices competitively.
8.Understanding customer preferences for different OS variations can help optimize pricing strategies to align with market demands.